Analyzing Gensini Score as a Semi-Continuous Outcome.

Background: Investigators frequently encounter continuous outcomes with plenty of values clumped at zero called semi-continuous outcomes. The Gensini score, one of the most widely used scoring systems for expressing coronary angiographic results, is of this type. The aim of this study was to apply two statistical approaches based on the categorization and original scale of the Gensini score to simultaneously assess the association between covariates and the presence and severity of coronary artery disease (CAD). Methods: We considered the data on 1594 individuals admitted to Tehran Heart Center with CAD symptoms from July 2004 to February 2008. The participants' baseline demographic and clinical characteristics were collected, and their coronary angiographic results were expressed through the Gensini score. The generalized ordinal threshold and two-part models were applied for the statistical analyses. Results: Totally, 320 (20.1%) individuals had a Gensini score of zero. The results of neither the two-part model nor the generalized ordinal threshold model showed a significant association between Factor V Leiden and the occurrence of CAD. However, based on the two-part model, Factor V Leiden was associated with the severity of CAD, such that the Gensini score increased by moving from a wild genotype to a heterozygote (β = 0.44; 95% CI: 0.20-0.69 in logarithm scale) or a homozygote mutant (β = 0.70; 95% CI: 0.28- 1.12 in logarithm scale). The proportional odds assumption was not met in our data ([Formula: see text]= 54.26; p value < 0.001); however, a trend toward severe CAD was also observed at each category of the Gensini score using the generalized ordinal threshold model. Conclusion: We conclude that besides loss of information by sorting a semi-continuous outcome, violation from the proportional odds assumption complicates the final decision, especially for clinicians. Therefore, more straightforward models such as the two-part model should receive more attention while analyzing such outcomes.


Introduction
Investigators frequently confront continuous outcomes, wherein a large number of observations equal to zero. In literature, these types of variables are called "semicontinuous", "zero-inflated continuous", and/or "clumped at zero" data. 1 Examples of semi-continuous outcomes are many in medical, economic, and ecological studies. In cardiovascular medicine, the Gensini score, one of the most widely used scoring systems, 2 is an example where a zero value indicates no luminal stenosis within the coronary artery tree, representing patients without coronary artery disease (non-CAD group). 3 This is a complete and useful, but not ideal, scoring system developed by Gensini, 3 which emphasizes more on the severity of CAD. It takes into account the information about the geographical location and degree of luminal narrowing as well as the cumulative effect of multiple obstructions. The severity of stenosis is indicated by the reduction in lumen diameter, and a nonlinear score is assigned to each lesion based upon it. Then, according to the functional importance of the area of each lesion in the coronary tree, a multiplier is applied. The Gensini score is the sum of the lesion scores. Although the Gensini score provides a quantitative variable, it is rarely used quantitatively in statistical analyses. The reason is that it is a semi-continuous outcome with a relatively large number of observed values clustered at zero that cannot be expressed through a single distribution and the right-skewed non-zero values cannot be transformed to normality.
A simple and common practice is to recode the Gensini score into a dichotomous variable of having CAD or not or to sort it into an ordinal variable using specific cut points. Nonetheless, such an approach leads to loss of information. Alternatively, the two-part approach uses one equation (generalized linear model usually using probit or logit link function) to model the likelihood of having a non-zero value and a second equation (ordinary linear regression) to model the values greater than zero. Even though the two-part model is conceptually attractive, it was developed in econometrics in the early 1980s. 4 Other statistical methods on modeling semi-continuous data including Tobit 5 and Heckman sample selection 6-8 were also originally described in econometric literature. The application of these models has scarcely been tested in medical studies. Hence, we placed emphasis more statistically rather than clinically on the Gensini score as a semi-continuous outcome and applied two approaches based on the categorization and use of the Gensini score in its original scale to simultaneously assess the association between covariates and the presence and severity of CAD.

Methods
In this cross-sectional study, we used the data of a published research on the association between Factor V Leiden with the presence and severity of CAD. The details of the data collection procedure and the participants were previously described. 9 Briefly, a total of 1594 individuals with symptoms related to CAD who were admitted to Tehran Heart Center (Tehran, Iran) for elective coronary angiography between July 2004 and February 2008 were included. Coronary angiography was performed via the percutaneous femoral approach using standard angiographic techniques, and the severity of CAD was expressed with a well-known Gensini score. 3 The participants' baseline demographic and clinical characteristics (including age, sex, body mass index [BMI], smoking status, family history of CAD, diabetes, hypertension, hyperlipidemia, creatinine, history of renal failure, left ventricular ejection fraction [LVEF], and Factor V Leiden) were collected after obtaining written informed consent. Concisely, current smokers were those who smoked any kind of tobacco daily or quitted smoking for < 1 month, and any proven CAD in a parent or sibling (under 55 and 65 years for men and women, respectively) was considered as a positive family history of CAD. The genotype analysis for Factor V Leiden was performed using the polymerase chain reaction-based restriction fragment length polymorphism (PCR-RFLP). The local ethics committee approved the study protocol.
After describing the data, we applied two approaches to analysis: the ordinal threshold model based on the categorization and the two-part model based on the original scale of the Gensini score.
As in ordinal threshold model a semi-continuous outcome is grouped into a number of ordered categories so that the first category contains zero outcomes and cut points will be selected to define the other categories, 10 we defined tertiles of the Gensini score as cut points and applied the logit link function. The relationship between Factor V Leiden and the Gensini score was assessed in unadjusted and adjusted models. Covariates with p values < 0.2 in the univariable analysis were considered in the adjusted model. Although the cumulative odds ratios (ORs) with 95% confidence intervals (CIs) were presented in the results, we found that the crucial proportional odds (PO) assumption, which investigates whether the relationship between the cumulative probabilities of the ordinal outcome categories and the covariates is the same for each category of the outcome, was not met in our data. Therefore, the generalized ordered logit model was fitted afterward, which provides different estimates at each category of the Gensini score. Two-part model considers a semi-continuous outcome as a mixture of two parts: "occurrence or binary" and "intensity (severity) or continuous". 4 With respect to our data, it means that there are two processes: one governs the occurrence of CAD (non-zero vs. zero Gensini score) and the other one manages the severity of CAD (positive values of the Gensini score) conditional on the occurrence of CAD. The logit link J Teh Univ Heart Ctr 11 (2) http://jthc.tums.ac.ir April 13, 2016 Analyzing Gensini Score as a Semi-Continuous Outcome function was used in the occurrence part, and the logarithm of the Gensini score was considered in the severity part to handle the skewness conditional where the non-zero Gensini score was observed. The effect of covariates with p values < 0.2 in the univariable analysis at each part of the model was adjusted in a multivariable model to investigate the relationship between Factor V Leiden and the Gensini score. The effects of the covariates on the occurrence of CAD were reported using OR with 95% CI, and their effects on the severity of CAD were presented through β estimates with 95% CI.

Results
The median age of the 1594 participants was 58 years old (1st quartile = 51, and 3rd quartile = 66), and 1022 (64.1%) were male. The individuals' baseline characteristics are shown in Table 1. The median of the Gensini score was 30.5 (min = 0, and max = 450), and 320 (20.1%) individuals had a zero score (non-CAD group). The histogram of the Gensini score is depicted in Figure 1. The relatively high fraction of zeros and the skewness are obvious in this figure.  Ordinal threshold model: The Gensini score cut points for the non-zero values were 24.5 (percentile 33.3) and 68 (percentile 66.7). Therefore, 0, > 0 to ≤ 24.5, > 24.5 to ≤ 68, and > 68 values were considered as Gensini score groups. The univariable cumulative logistic regression models revealed that all the covariates except hypertension (p value = 0.903) had a statistically significant relation with the Gensini score and were considered as potential confounders in the association between Factor V Leiden and the Gensini score. The results of the adjusted ordinal threshold model are shown in Table 2. However, we found that the proportional odds assumption was violated in our data (X 2 = 54.26, df = 22; p value < 0.001). Therefore, the relationship between Factor V Leiden and the Gensini score was not the same in the different classes of the Gensini score and the cumulative OR for the relationship between a heterozygote and a homozygote mutant with the Gensini score relative to a wild genotype could not be considered as the unique estimates of 2.05 and 4.62, respectively.
The results of fitting the generalized ordered logit model are presented in Table 3. Obviously, due to separate estimations for the cumulative ORs at each category of the Gensini score, it is too complicated to make a general decision for the effect of the covariates. Regarding Factor V Leiden, it was difficult to conclude about the effects; however, the nonsignificant results for Gensini scores > zero versus zero Gensini score might reflect that heterozygote or homozygote mutant genotypes did not have any association with the occurrence of CAD as compared to wild genotype (p value = 0.375 and p value = 0.488) and that they merely played role in the severity of CAD. Despite the complexity of this model, the trend toward a higher Gensini score and, thus, more severe CAD with Factor V Leiden was observed by moving from a wild genotype to a heterozygote (OR = 1.81 and OR = 2.52) or a homozygote mutant (OR = 3.18 and OR = 5.87).
J Teh Univ Heart Ctr 11 (2) http://jthc.tums.ac.ir April 13, 2016 Two-part model: Factor V Leiden was not associated with the occurrence of CAD in the unadjusted model (p value = 0.270). Based on the univariable analysis, we found that the effect of hypertension (p value = 0.230) in the occurrence part and family history of CAD (p value = 0.363) in the severity part needed no adjustment. Table 4 shows the unadjusted and adjusted associations between Factor V Leiden and the occurrence of CAD as well as the severity of disease in the CAD group. Accordingly, Factor V Leiden only had a statistically significant effect on the severity of CAD even after adjustment for the effect of the other covariates (p value < 0.001) and not on its occurrence (OR = 1.25, 95% CI: 0.68-2.31 and OR = 1.42, 95% CI: 0.46-4.43 for heterozygote and homozygote mutant, respectively). In other words, the logarithm of the Gensini score increased by 0.44 and 0.70 by moving from a wild genotype to a heterozygote and a homozygote mutant in the CAD group, respectively.

Discussion
In the present study, the ordinal threshold and two-part models were applied to simultaneously assess the association between Factor V Leiden and the occurrence and severity of CAD using a semi-continuous Gensini score. Using the ordinal threshold model, zero values were considered as one group and the other values were classified. We found that PO assumption was not met in our data, so separate estimates for the aforementioned relationship in the different classes of the Gensini score were presented. Several estimates for the cumulative ORs provided a complicated model, which makes it difficult, especially for clinicians, to come to a clear conclusion. As Min and Agresti 1 mentioned, arbitrary selection of the cut points and the number of categories as well as loss of information due to categorizing a continuous variable are among other disadvantages of this approach. Using this model, we found it difficult to arrive at a conclusion regarding the association between Factor V Leiden and the severity of CAD since it varies in the different categories of the Gensini score. Nevertheless, a trend from a nonsignificant association with the occurrence of CAD toward a stronger association with the severity of CAD was observed by moving from a wild genotype to a heterozygote or homozygote mutant. A more appropriate approach, the two-part model with the characteristic of considering two processes for the occurrence and severity of disease was then applied. The comprehension and interpretation of this approach is straightforward, especially for clinicians. The two-part model was preferred to its Tobit or Heckman sample selection counterparts since the Tobit model takes into account the same influence for covariates on the occurrence and severity of CAD. 5 However, as was observed in our data, this might not always be the case. Also, we avoided using the Heckman sample selection model because zeros in the Gensini score are actual and not missing or censoring data. 11,12 Findings based on this model clearly revealed that Factor V Leiden was not significantly associated with the occurrence of CAD neither before nor after adjustment for the effect of the other covariates. Nevertheless, when CAD happened, this factor was associated with its severity, such that as compared to wild genotype, heterozygote and homozygote mutant were associated with an increase in the Gensini score values, indicating more severe CAD. Although the trend was also shown using the generalized ordinal threshold model, it was based on the estimation of a number of ORs. It was very easily concluded through single β estimate from the two-part model. Another salient point is that the number of estimated ORs depends on the arbitrary selection of the number of cut points. Using these data before and considering the vessel score in analyses, Boroumand et al. 9 reported the existence of an association between Factor V Leiden and both the occurrence and the severity of CAD; however, using a more informative scoring system, we did not observe this relation for the occurrence of CAD. This can be explained by the difference in the classification of CAD patients. In the current study, the non-CAD group was considered as individuals with a zero Gensini score (normal subjects), whereas Boroumand et al. 9 considered normal and minimal (< 50% luminal stenosis) individuals as the non-CAD group. Therefore, individuals with a small Gensini score who might be in the minimal classification of the vessel score were considered in the CAD group in the present study. Hence, it appears that one should be cautious about combining minimal coronary with normal subjects. Chang et al. 13 concluded that only PO assumption played a role in selecting between the two-part and ordinal threshold models and that only in the case of PO assumption failure was it possible for the predictors of zero values to be different from those of other values; otherwise, there was no priority in choosing between these two approaches. However, in our opinion, the nature of data is very important. In our data, zero values for the Gensini score represent normal individuals without luminal stenosis, and separating them from others seems rational when investigating the effect of covariates or doing adjustment to study some favorable relationship.
In this study, we used logit and logarithm link functions for the two parts of the model because it is easy to interpret the results. The link function for the occurrence part is limited to logit or probit; however, attempts have been made to consider other link functions such as gamma or logskew-normal for the severity part of the model. 14,15 Another important issue worth noting is that since longitudinal and repeated measurements designs are commonly observed in medical studies, statisticians have recently focused more on providing suitable methods for modeling semi-continuous outcomes in these situations usually by considering randomeffects in models. [16][17][18][19][20] In this study, we focused on the Gensini score. Be that as it may, interest toward using other scoring systems such as the SYNTAX score 21 has been raised recently. The Gensini score is a good example of a semi-continuous outcome in clinical studies. A relatively high proportion of zero values and right skewed distribution might be frequently observed using other outcomes or scoring systems, and using proper statistical approaches to the analysis of such data will lead to more precise results.
One of the limitations of this study was that we were not able to do a comparison between the applied models easily using routine measures such as the Akaike information criterion (AIC) or Bayesian information criterion (BIC) because this requires using all data in the two models. In the two-part model, the occurrence part uses all the data and the severity part uses just the non-zero data, which leads to the estimation of the pseudo-likelihood rather than the likelihood, whereas the ordinal threshold model provides the likelihood. Therefore, we avoided this comparison, which might be studied more in further research.

Conclusion
We conclude that besides loss of information by sorting a semi-continuous outcome, violation from the PO assumption complicates arriving at a clear decision, especially for clinicians. Therefore, paying more attention to more straightforward models such as the two-part model is recommended when analyzing such outcomes.